Multi-Label Classification

🏠 ⮐ Artificial Intelligence ⮐ Machine Learning ⮐ Supervised Learning ⮐ Classification ⮐

Multi-label classification addresses problems where each instance can simultaneously belong to multiple classes rather than just one. Unlike multi-class classification where classes are mutually exclusive, multi-label problems allow any subset of labels to be active. Applications include document tagging (a news article can be both "politics" and "economics"), image annotation (a photo containing "beach," "sunset," and "person"), gene function prediction, and music categorization by multiple genres.

Problem transformation methods adapt multi-label tasks for standard classifiers. Binary Relevance trains an independent binary classifier for each label, treating each as a separate yes/no question. This is simple but ignores label correlations. Classifier Chains orders labels and trains sequential classifiers where each uses predictions from previous labels as additional features, capturing dependencies but requiring arbitrary ordering. Label Powerset treats each unique combination of labels as a distinct class, preserving correlations but potentially creating exponentially many classes, many with few examples.

Algorithm adaptation methods modify existing algorithms to handle multiple labels natively. Decision trees can be adapted to predict label sets at leaves. Neural networks naturally support multi-label output using sigmoid activation on each output node (rather than softmax), allowing independent probability estimates per label. Threshold optimization becomes important—each label may need its own decision threshold rather than using 0.5 uniformly.

Evaluation metrics must account for partial correctness. Hamming Loss measures the fraction of labels incorrectly predicted (either false positives or false negatives). Subset Accuracy requires exact match of the entire label set (strictest metric). F1-score can be computed per label then averaged (macro/micro). Jaccard Index (intersection over union) measures similarity between predicted and true label sets. Different applications prioritize different metrics based on whether missing labels or adding incorrect labels is more costly.

Label correlation is a key consideration. Some label combinations occur frequently (e.g., "beach" and "ocean" in images) while others are mutually exclusive or rare. Methods that exploit these correlations often outperform binary relevance. Label space dimensionality reduction can project labels into lower-dimensional spaces when dealing with hundreds or thousands of potential labels, learning embeddings that capture label relationships.

Popular Algorithms

Binary Relevance - Trains independent binary classifier for each label; simple but ignores label correlations
- https://scikit-learn.org/stable/modules/multiclass.html#multioutput-classification
Classifier Chains - Sequential binary classifiers where each uses previous predictions as features; captures label dependencies
- https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.ClassifierChain.html
Label Powerset - Treats each unique label combination as a single class; preserves correlations but faces combinatorial explosion
- http://scikit.ml/api/skmultilearn.problem_transform.lp.html
ML-KNN - Adapts k-nearest neighbors for multi-label by examining label frequencies among neighbors
- http://scikit.ml/api/skmultilearn.adapt.mlknn.html
Random Forest (Multi-label) - Can be adapted to predict multiple labels at leaf nodes or used with problem transformation
- https://scikit-learn.org/stable/modules/ensemble.html#random-forests
Neural Networks (Multi-label) - Use sigmoid activation on output layer allowing independent probabilities per label; trained with binary cross-entropy loss
- https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html
- https://www.tensorflow.org/api_docs/python/tf/keras/losses/BinaryCrossentropy